The influence of T cell development on pathogen specificity and autoreactivity 
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T cells orchestrate adaptive immune responses upon activation. T cell activation requires suf- 
ficiently strong binding of T cell receptors on their surface to short peptides derived from foreign 
proteins bound to protein products of the major histocompatibility (MHC) gene products, which 
are displayed on the surface of antigen presenting cells. T cells can also interact with peptide-MHC 
complexes, where the peptide is derived from host (self) proteins. A diverse repertoire of relatively 
self-tolerant T cell receptors is selected in the thymus. We study a model, computationally and 
analytically, to describe how thymic selection shapes the repertoire of T cell receptors, such that T 
cell receptor recognition of pathogenic peptides is both specific and degenerate. We also discuss the 
escape probability of autoimmune T cells from the thymus. 
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INTRODUCTION 

Despite constant exposure to infectious microbial 
pathogens, higher organisms are rarely sick. This is be- 
cause the innate immune system is quite successful in 
clearing pathogens before they can establish an infection. 
However, the components of the innate immune system 
respond only to common evolutionary conserved mark- 
ers expressed by diverse pathogens. Some bacteria and 
most viruses have evolved strategies to evade or over- 
come these innate mechanisms of protection. A second 
arm of the immune system, adaptive immunity, combats 
pathogens that escape the innate immune response. The 
adaptive immune system is remarkable in that it mounts 
pathogcn-spccihc responses against a diverse and evolv- 
ing world of microbes, for which pathogen specificity 
cannot be pre-programmed. During an infection, cells 
of the adaptive immune system that are specific for the 
pathogen proliferate. Once the infection is cleared, most 
of these cells die, but some remain as memory cells which 
respond rapidly and robustly to re-infection by the same 
pathogen. This immunological memory is the basis for 
vaccination. 

T cells play a key role in orchestrating the adaptive 
immune response. They combat pathogens that have in- 
vaded host cells. Pathogen-derived proteins in infected 
host cells are processed into short peptides (p) which 
can bind to major histocompatibility (MHC) proteins 
expressed in most host cells. The resulting pMHC com- 
plexes are presented on the surface of host cells as molec- 
ular markers of the pathogen. T cells express a protein on 
their surface called the T-ccll receptor (TCR). Each T- 
cell receptor (TCR) has a conserved region participating 
in the signaling functions and a highly variable segment 
responsible for pathogen recognition. Because the vari- 
able regions are generated by stochastic rearrangement of 
the relevant genes, most T cells express a distinct TCR. 



When we say that a given T cell recognizes a particular 
pMHC complex, we mean that its TCR binds sufficiently 
strongly to it to enable biochemical reactions inside the 
T cell that result in activation and proliferation of this 
particular "pathogen-specific" T cell clone. 

The diversity of the T cell repertoire enables the im- 
mune system to recognize many different pathogenic 
pMHCs. Peptides presented on MHC class I are typically 
8-11 amino acids long TCR recognition of pMHC is 
both degenerate and specific. It is degenerate, because 
each TCR can recognize several peptides Q . It is specific, 
because most point mutations to the recognized peptide 
amino acids abrogate recognition 0] ■ 

Host proteins (e.g., those that are misfoldcd) are also 
processed into short peptides, and are presented on the 
surface of cells in complex with MHC proteins. The 
gene rearrangement process ensuring the diversity of 
TCR may result in generating T cells harmful to the 
host, because they bind strongly to self pMHCs. T 
cells which bind too weakly to MHC are also not use- 
ful as they cannot interact with pathogenic peptides. 
Such T cells, which bind too weakly or too strongly to 
self pMHC molecules are likely to be eliminated during 
T cell development in the thymus [IHH- Immature T 
cells (thymocytes) move around the thymus and interact 
with a diverse set (10 3 -10 4 ) of self pMHCs presented on 
thymic cells. Thymocytes expressing TCRs that bind too 
strongly to any self-pMHC are likely to be deleted (a pro- 
cess called negative selection). However, a thymocytes 
TCR must also bind sufficiently strongly to at least one 
self-pMHC to receive a survival signal and emerge from 
the thymus (a process called positive selection). The 
threshold binding strength required for positive selection 
is weaker than that which is likely to result in negative 
selection. 

Signaling events, gene transcription programs, and cell 
migration during T cell development in the thymus have 
been studied extensively [I-14|. An understanding of 
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how interactions with sclf-pMHC complexes in the thy- 
mus shape the peptide binding properties of selected 
TCR amino acid sequences such that mature T cells ex- 
hibit their special properties is also beginning to emerge. 
Recently experiments carried out by Huseby et al. 
provided important clues in this regard. Experiments 
were carried out to contrast T cells that developed in a 
normal mouse with a diversity of self pMHC molecules 
in the thymus and those that developed in a mouse that 
was engineered to express only one type of self pMHC 
in the thymus. For T cells that developed in a normal 
mouse, pathogen recognition was found to be very sen- 
sitive to most point mutations of recognized pathogenic 
peptides. In contrast, T cells that developed in the engi- 
neered mouse were found to be much more cross-reactive. 

To address these issues, we previously studied a sim- 
ple model, where TCRs and pMHCs were represented by 
strings of amino acids [l5[ (Fig.QJi), using numerical [l6[ 
and analytical [13] methods. Our results provided a sta- 
tistical perspective on the origin of how T cells recognize 
foreign pathogens in a specific, yet degenerate, manner. 
They also provided a conceptual framework for diverse 
experimental data 18, Tj^. In this paper, we extend the 
model, and study new phenomena that include: 1) How 
TCR-MHC interactions differ upon development against 
different numbers of peptides in the thymus, and how this 
influences T cell cross-reactivity? 2) What are the se- 
quence characteristics of pathogenic peptides recognized 
by T cells? 3) How does stochastic escape from negative 
selection in a normal thymus influence T cell specificity 
for pathogenic peptides? 4) How does the frequency of 
autoimmune T cells change upon modulating the number 
of peptides encountered during T cell development? 



MODEL 

To describe the interactions between TCRs and 
pMHCs, we model them as strings of amino acids. These 
strings indicate the amino acids on the interface between 
TCRs and pMHCs. In the simplest incarnations of the 
model, it is assumed that each site on a TCR interacts 
only with a corresponding site on a pMHC (Fig.QJi). The 
binding interface of a TCR is composed of a more con- 
served region that is in contact with the MHC molecule 
and a highly variable region that makes the majority of 
contacts with the peptide. Therefore, we explicitly con- 
sider only amino acids of the latter part of the TCR, 
but not the former. Similarly, there are many possible 
peptides that can bind to MHC, and their sequences are 

Till, 
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considcred explicitly. Prior to our work 
pMHC interactions have been represented using string 
13 -IH, but these studies did not have an ex- 
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plicit treatment of amino acids or consider the mecha- 
nistic issues we did (including connections to human dis- 
ease [H). 



To assess the effects of thymic development on 
pathogen recognition characteristics, we evaluate the free 
energy of interaction between TCR-pMHC pairs. The in- 
teraction free energy is composed of two parts: a TCR 
interaction with MHC and a TCR interaction with the 
peptide. The former is given a value E c , which may be 
varied to describe different TCRs and MHCs. The lat- 
ter is obtained by aligning the TCR and pMHC amino 
acids that are treated explicitly and adding the pairwise 
interactions between corresponding pairs. For a given 
TCR-pMHC pair, this gives 
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,t,s) = E7 C + ]P J (ti,Si) , 



(1) 



; = 1 



where J(ti,Si) is the contribution from the ith amino 
acid of the TCR (ti) and the peptide (s^ and N ~ 5 is 
the length of the variable TCR-peptide region. The ma- 
trix J encodes the interaction energies between specific 
pairs of amino acids. For numerical purposes we use the 
Miyazawa- Jernigan matrix [20| that was developed in the 
context of protein folding, but as will be described later 
the qualitative results do not depend on the form of J. 

To model thymic selection, we start by randomly gen- 
erating a set of M self peptides, where amino acids 
are picked with frequencies corresponding to the human 
proteome 

[H ED] (using the mouse proteome does not 
change the qualitative results [HI)- Then we randomly 
generate TCR sequences with the same amino acid fre- 
quencies. To mimic thymic selection, TCR sequences 
that bind to any self-pMHC too strongly (E int < E n ) are 
deleted (negative selection). However, a TCR must also 
bind sufficiently strongly (-Eint < E p ) to at least one self- 
pMHC to receive survival signals and emerge from the 
thymus (positive selection). Recent experiments show 
that the difference between the thresholds for positive 
and negative selection is relatively small (a few fc^T) [Til ]. 
The threshold for negative selection (E n ) is quite sharp, 
while the threshold for positive selection (E p ) is soft. Re- 
placing soft thresholds with perfectly sharp thresholds at 
E n and E p does not change the qualitative behavior of 
the selected T cell repertoire (see below and ref. [Til]). 
However, we do carry out calculations with soft thresh- 
olds as well to study the escape of potentially autoim- 
mune T cells, and their pathogen recognition character- 
istics. 

To completely specify the interaction free energy be- 
tween a TCR and pMHC, we need to specify the value of 
E c . In previous studies H 12 1 we fixed the value of E c 
for all TCRs at some moderate value, because too strong 
binding to MHC (large \E C \) would result in negative se- 
lection with any peptide, and too weak a binding to MHC 
(small | -E c |) would result in TCR not being positively se- 
lected. Each human can have up to 12 different MHC 
types. A TCR that binds strongly to more than one 
MHC type is likely to be eliminated during negative se- 
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FIG. 1: Effects of thymic selection on the characteristics of selected TCRs. a) Schematic representation of the interface between 
TCR and pMHCs. The region of the TCR contacting the peptide is highly variable and is modeled by strings of amino acids of 
length N. The peptide is also treated similarly. The binding free energy between the TCR and the entire pMHC is computed 
as described in the text, b) Amino acid composition of TCR selected against M types of self peptides in the thymus. The 
ordinate is the ratio of the frequency of occurrence of an amino acid in the peptide contact residues of selected TCRs and the 
pre-selection frequency. TCRs selected against many types of self-peptides in the thymus have peptide contact residues that 
are enriched in amino acids that interact weakly with other amino acids. Amino acids on the abscissa are ordered according 
to their largest interaction strength with other amino acids in the potential matrix, J. c) Probability density distribution 
of E c values (strength of TCR binding to MHC) of TCRs selected against M types of self peptides. TCRs selected against 
many types of self peptides are more likely to bind weakly to MHC. The parameter values are: TV = 5, E v — E n — 2.5kBT, 
Sc,max — E p — NJ, -E c ,min = E n — NJ. We have used the Miyazawa-Jernigan matrix, J [2(| and amino acid frequencies f a 
from the human proteome [IB, EH (using the mouse proteome does not change the qualitative results 



lection. Therefore, we consider TCRs that are restricted 
by a particular MHC type. We expect that variations 
in E c for selected TCRs are small. A rough estimate on 
the bounds can be obtained from the condition that the 
average interaction free energy between TCR and pMHC 
for selected TCRs is between the thresholds for positive 
and negative selection 



E n <E C + NJ< E p , (2) 



where J is the average value of interaction between amino 
acids. The upper (lower) bound E CtIaax = E p — NJ 
(-E c ,min = E n — NJ) ensures that, on average, interac- 
tions result in positive selection and not negative selec- 
tion. Since for selection it is enough that a TCR se- 
quence is positively selected by one of many self pep- 
tides and avoid being selected by encountered self pep- 
tides, the actual bounds for E c might be different, but 
we expect that the range of E c values is still small; viz., 
E c ,max — -Ec,min oc E p — E n . To every TCR sequence we 
assign a random value of E c chosen uniformly from the 
interval (-E^mim E Ctmax ), and subject it to the selection 
processes. Note that TCRs with interactions with MHCs 
that are too weak are unlikely to be oriented on MHCs 
properly and hence will be unable to interact with the 
peptide. Thus, one cannot tune E c to very low values to 
escape negative selection. 



RESULTS 

TCRs selected against many self peptides are 
enriched with weakly interacting amino acids and 
bind more weakly to MHCs 

First, we study how thymic selection shapes TCR se- 
quences and TCR interactions with MHC. A million 
randomly generated TCR peptide contact residues with 
randomly assigned E c values are generated and selected 
against M randomly generated self peptides according to 
the thymic selection rules described in the previous sec- 
tion. For the set of selected TCRs, we assess their amino 
acid composition and their interactions with MHC (E c 
values). The whole process is repeated thousand times 
to obtain proper statistics. The peptide contact residues 
of TCR sequences selected against many self peptides 
are statistically enriched with weakly interacting amino 
acids (Fig. [TJd), and TCRs with weaker binding to MHC 
(within the allowed range) are more likely to get selected 
(Fig. Q}j). This is because negative selection imposes a 
strong constraint. When selected against many self pep- 
tides, TCR sequences with peptide contact residues con- 
taining strongly interacting amino acids (e.g., hydropho- 
bic amino acids or those with flexible side chains) or 
TCRs that bind strongly to MHC are more likely to bind 
strongly with at least one self-pMHC and thus be nega- 
tively selected. This qualitative result agrees with exper- 
iment 0, EH and is independent of details of the inter- 
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action potential J or the sharpness of the thresholds for 
positive and negative selection as will be shown next. 

The selection of a given TCR sequence t is determined 
by the strongest interaction with all self peptides, and a 
TCR is selected when 



E, 



< min \E- n 



(E c ,t,s)} <E P . 



(3) 



In Ref. [I?} we showed that by using the Extreme Value 
Distribution one finds that the strongest interaction en- 
ergy with M random self peptides is sharply peaked 
around 
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E (E c ,t) =E C + J2^(U) 
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Here f a and p{E c ) are the natural frequencies of the dif- 
ferent amino acids and the distribution of E c values prior 
to selection, whereas the effect of thymic selection is cap- 
tured in the parameter (3, which is determined by the 
condition that the average free energy falls in the inter- 
val (E n ,E p ). The complication, presented by the square 
root term in Eq. (|4]), for determining parameter f3 is easily 
dealt with by Hamiltonian minimization [22| and intro- 
ducing an effective Hamiltonian, 

A 

H (E c ,t) =E C + J2 [£(U) -7V(*i)] -lnM/(2 7 ). (7) 

i=l 



N 



(2 InM) V(ij), (4) This corresponds to Boltzmann weights 



where £{a) = [J{t il a)] a and v(a) = [J(ti> a ) 2 ] a — 
[J(ti,a)] are the average and the variance of the inter- 
action free energy of amino acid a with all others. We 
have denoted the average over self amino acid frequencies 
by [G(a)] = Yl a =i faG(a). From this equation and the 
selection condition ([3]) we see that, as the number of self 
peptides, M, increases, the chance of negative selection 
does too. To counterbalance this pressure for large M, 
TCRs arc enriched with weakly interacting amino acids 
in their peptide contact residues and TCRs that interact 
weakly with MHC (small E c value) (see Fig. [I]). A simi- 
lar effect can be obtained with amino acids with smaller 
variance of interactions, but this effect is less pronounced 
because of the square root. Even if it were in effect, it 
would pick out TCRs with smaller variance, which for the 
case where the means were the same, also imply selecting 
the more weakly binding amino acids. These results are 
independent of the form of the statistical potential be- 
tween contacting amino acids. Different potentials only 
change the identities of weak and strong amino acids. 

The probabilities with which amino acids are chosen 
for the selected TCRs in the T cell repertoire depend 
on the conditions (e.g. the number of peptides present 
in the thymus). This dependency can be formalized by 
using statistical mechanical methods that apply in the 
limit of very long peptides and remarkably the results 
seem to be accurate even for short peptides 
thymic selection condition 



Uj. The 



E n < Eq (E c , ij < E p 



(5) 



can be interpreted as a micro-canonical ensemble of se- 
quences t , which are acceptable if the value of the Hamil- 
tonian, Eq \ E c ,t), falls on the interval (E n ,E p ), In the 
limit of long peptides canonical and micro-canonical en- 
sembles are equivalent. Thus the probability for TCR 
selection is governed by the Boltzmann weight 
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{E c ,t) oc []Jf u )p(E )exp 



pE a {E c ,t) 



(6) 



P (E c ,t) cc p(E c )ex.p[-PE c ) 



xIlliUu exp[-/3(^)-7V(i 2 ))]} 

(8) 

for which thermodynamic quantities are easily computed. 
7 (/3) = y/lnM/ (2N(V)) is determined by minimizing 
the effective Hamiltonian Hq (E c ,t) with respect to 7, 
which ensures that the average free energies (Eq (E c ,t)) 
and (Hq (E c , t)) are the same; the averaging is done over 
all TCR sequences t weighted with Boltzmann weights 
© and ©. 

/3 is determined by constraining the average free energy 
to the range (E n ,E p ), while maximizing entropy. Given 
the bounded set of free energies, the parameter j3 can 
be either negative or positive. The values for Eq (E c ,t) 
span a range from E m \ n to £ max , and a corresponding 
number of states Ci(Eo) is bell-shaped between these ex- 
tremes with a maximum at some -E m id- If -E-rmd > E p , 
we must set /3 such that (Eq (E c ,t)) = E p . In this case, 
/? > 0, positive selection is dominant and selected TCRs 
contain peptide contact residues with stronger amino 
acids and TCRs that interact strongly with MHCs. If 
E m id < E n , we must set ft such that (Eq (E c ,t)} = E n ; 
now /3 < 0, negative selection is dominant, and TCRs 
with peptide contact residues with weaker amino acids 
and TCRs that interact weakly with MHCs are selected. 



For E n < -Emid < E p , we must set 



and there 



is no modification due to selection. The resulting phase 
diagram of parameter j3 is shown in Figure 

For the relevant parameters in mouse (i.e. N = b, 

Ep — E n = 2.5/csT, -E'cmax = E p — NJ, E Cjm i n = E n — 

NJ, and M = 10 3 ), we find (3 = -3.06(/c B T)- 1 and 
7 = 0.94(fcsT) _1 , negative selection is dominant and 
weaker amino acids are selected. This result is consistent 
with experiments [l6| . With these parameters we can 
calculate the amino acid frequencies of selected TCRs as 

& el) = A exp [-/?(£(») - 7 V( Q ))] 

E£i A «p [-£(£(&) - 7 V(b))] 

and the distribution of selected TCRs interactions with 
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FIG. 2: Analytical results for characteristics of selected TCRs. (a) Representation of the dependency of the parameter ft (a 
measure of amino acid composition of selected TCRs and a measure of selected TCR binding strengths to MHCs, see text), on 
the number of self-peptides (\nM/N) and the threshold for negative selection E„ with (E p — E„)/N = (-B c ,max — E c ,min)/N — 
0.5/cbT. The region between the black lines corresponds to /3 = 0, to the right (left) of which negative (positive) selection is 
dominant, and weak (strong) amino acids are selected. The blue dashed lines indicate the relevant parameter values for thymic 
selection in mouse, (b) Amino-acid composition of selected TCR sequences, ordered in increasing frequency along the abscissa, 
(c) Probability density distribution of E c values (strength of TCR binding to MHC) of selected TCRs. (b-c) The data points 
in black are obtained numerically with the parameters relevant to mouse (see text). The error bars reflect the sample size 
used to generate the histograms and differences for different realizations of M self-peptides. The red lines are the result of the 
EVD analysis in the large N limit (from Eqs. <(9j and (JTUJ) ) , and the agreement is quite good. In both cases we have used the 
Miyazawa-Jernigan matrix, J [53] and amino acid frequencies f a from the human proteome [l6l. l2lj|. 



MHCs as 



P {scl) (Ec 



p(E c ) exp [-pE c ] 



(10) 



where p{E c ) is the distribution of TCR interactions with 
MHCs before selection, which is taken to be a uniform 
distribution in our simulations. We find (Figs.Eb and[2j:) 
that the analytical results above agree very well with 
the numerical results of simulations for N = 5 and the 
parameters presented above. 



Selection against many self peptides is required for 
pathogen-specific T cells 

How does such a T cell repertoire lead to specific 
recognition of pathogenic peptide? To study the speci- 
ficity of mature T cells for pathogenic peptide recogni- 
tion, we challenge selected TCR sequences with a col- 
lection of many randomly generated pathogenic peptides 
where amino acid frequencies correspond to L. monocy- 
togenes [l6, 23 1, a pathogen that infects humans. TCR 
recognition of pathogenic peptide occurs if TCR-pMHC 
binding is sufficiently strong (E- lnt < E r ), where the 
recognition threshold in mouse experiments is such that 
E r ~ E n [24j |. For each TCR that recognizes a partic- 
ular pathogenic sequence, the specificity of recognition 
was tested. Each site on the peptide was mutated to 
all other 19 possibilities, and recognition of the reactive 



TCRs was again assessed. If more than half the muta- 
tions at a particular site abrogated recognition by the 
same TCR, this site was labeled an important contact. 
For each strongly bound TCR-pMHC pair, the number 
of important contacts was determined. After summing 
over all selected TCRs and pathogenic peptides, we ob- 
tained a histogram of the number of important contacts 
(Fig. [3^) . The higher the number of important contacts 
the more specific is the TCR recognition of pathogenic 
peptide. Small numbers of important contacts corre- 
spond to cross-reactive TCRs that are able to recognize 
many pathogenic peptide mutants. The obtained result 
is qualitatively the same as the one obtained in previous 
studies [l6, 17 1, where the binding free energy of TCRs 
with MHC was fixed. 

In agreement with experiments 0, 0] we find in our 
model that TCRs selected against many different self 
peptides are very specific, while TCRs selected against 
only one self peptide are more cross- reactive (Fig. [3^,). 
Based on the amino acid composition of selected TCRs 
(see previous section), we can provide a mechanistic 
explanation for the specificity /degeneracy of pathogen 
recognition. Because TCR peptide contact residues are 
enriched with weakly interacting amino acids and TCRs 
are more likely to react moderately to MHC, they can in- 
teract sufficiently strongly for recognition to occur only 
with pathogenic peptides that are statistically enriched in 
amino acids that are the stronger binding complements 
of the peptide contact residues of the TCR (Fig. [3p). 
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number of important contacts for TCR recognition of pathogenic peptides strong a. a. weak a.a. 



FIG. 3: Characteristics of TCR recognition of pathogenic peptides, a) Histogram of the number of important contacts (defined 
in text) with which T cells recognize pathogenic peptides. T cells selected against many self peptides recognize pathogenic 
peptides via many important contacts and are thus specific. In contrast T cells selected against few types of self peptides 
recognize pathogenic peptides with only a few important contacts and are thus cross-reactive, b) Amino acid composition of 
pathogenic peptides that are recognized by at least one of the selected TCRs. TCRs selected against many types of self peptides 
recognize pathogenic peptides that are enriched with strongly interacting amino acids. In contrast TCRs selected against few 
types of self peptides may also recognize pathogenic peptides that contain weakly or moderately interacting amino acids. We 
used the same set of parameters as in Figure [T] to obtain these results. 



Such TCR-pcptidc interactions generate weak to moder- 
ate interactions which sum up to provide sufficient bind- 
ing strength for recognition; each interaction contributes 
a significant percentage of the total binding affinity. If 
there is a mutation to a peptide amino acid of a rec- 
ognized peptide, it is likely to weaken the interaction 
it participates in as recognized peptides are statistically 
enriched in amino acids that interact strongly with the 
TCR's peptide contact residues. Weakening an interac- 
tion that contributes a significant percentage of the bind- 
ing free energy is likely to abrogate recognition because 
the recognition threshold is sharply defined [IH . 

This statistical view of TCR specificity for antigen may 
describe the initial step of binding, which may then al- 
low modest conformational adjustments This mech- 
anism also suggests an explanation for why TCR recog- 
nition of pathogenic peptides can be degenerate. There 
are many combinatorial ways of distributing strongly in- 
teracting amino acids along the peptide, which lead to 
sufficiently strong binding with TCR for recognition. 

In agreement with experiments Q for T cells that de- 
velop in mice with many peptides in the thymus, suf- 
ficiently strong binding for recognition is achieved via 
many moderate bonds and each of these bonds is impor- 
tant for recognition. In contrast TCR sequences selected 
against only one type of self-peptide have a higher chance 
of containing strongly interacting amino acids and have a 
higher chance to bind more strongly to the MHC (Fig.[T]). 
Such TCRs can recognize a lot more pathogenic peptides 
including the ones that contain weakly or moderately in- 
teracting amino acids (Fig. \3jp) . In many cases mutat- 



ing such amino acids on the peptide does not prevent 
recognition of the same TCR because a small number of 
strong contacts dominate recognition (Fig. [3K and experi- 
ments Q), and unless these specific ones are disrupted by 
mutations to the peptide, recognition is not abrogated. 
Accordingly TCR recognition of pathogenic peptides is 
more cross-reactive. 

It may also happen that the binding interaction be- 
tween a TCR and pathogenic peptide-MHC is sufficiently 
strong that a single mutation of peptide amino acids 
cannot prevent recognition, which results in impor- 
tant contacts. This may happen because of the stronger 
binding of TCRs to MHC and because of the higher 
chance of TCRs having strongly interacting peptide con- 
tact residues. When selected against fewer types of self- 
peptides, TCRs that bind strongly to MHC can escape 
(Fig. Q};) . Thus in this case the escape of TCRs that bind 
strongly or moderately to more than one MHC type (or 
MHC with mutations) might also be possible, leading to 
more cross-reactivity to MHC types (or substitutions of 
MHC amino acids Q). 



Characteristics of foreign peptides recognized by T 

cells 

Once T cells complete thymic selection, a set T of 
TCRs (K in number) is released in the blood stream, 
where they try to identify infected cells. A T cell rec- 
ognizes infected cells when its TCR binds sufficiently 
strongly (Ei nt < E r ) to foreign peptide-MHC. Expcri- 
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mental evidence [24[ suggests that the negative selection 
threshold in the thymus is the same as the recognition 
threshold in the periphery, i.e. E r ~ E n . This means 
that a foreign peptide of sequence s is recognized if its 
strongest interaction with the set of TCRs exceeds the 
threshold for recognition, i.e. 



min {£i nt 
{E c ,t}eT 1 



E r , t, 



')} < E n 



(11) 



Eq. ([TT]) casts the recognition of foreign peptides as 
another extreme value problem. To calculate the proba- 
bility P rec (s ) that a foreign peptide sequence s is recog- 
nized by T cells and to calculate the amino acid compo- 
sition of recognized foreign peptides in the limit of long 
peptide sequences, we use the same procedure that was 
used in Ref. [I?} (briefly discussed in previous section) 
to calculate the properties of selected TCRs. Therefore, 
here we just briefly summarize the necessary steps. 

Let us indicate by p* (x\s) the probability density func- 
tion (PDF) of the interaction free energy between the for- 
eign peptide s and a random TCR that is selected in the 
thymus. Then the probability that foreign peptide s is 
recognized is obtained by integrating the extreme value 
distribution (EVD) II* (x\s) over the allowed range: 



Pec (s ) 



Ii*(x\s) dx, with 



U*(x\s) = Kp* (x\s ) [1 - P* (E < x\s] 



iK-l 



(12) 



where P* is the cumulative probability of the PDF 
p*. If we model the set T of selected T cells as K 
strings in which each amino acid is chosen independently 
with frequencies (i.e. we ignore the correlations 

among different positions on the string), then in the 
limit of long peptide sequences (large N) we can ap- 
proximate the PDF p* (x\s) with a Gaussian with mean 
E* v (s) = (E c ) + E?=i£*(^) and variance V* (s) = 

( E c)c + E*Li.V*(si). The mean £*(s{) and the vari- 
ance V*(sj) of the amino acid interaction free energies 
are obtained as in the previous section by appropriately 
replacing f a with fa Sel \ The mean (E c ) and the variance 
(E 2 C ) C = {E 2 C ) - (E c ) 2 of selected TCR interactions with 



In 



- Xp(E c )exp[-/3E c ]dE c 



MHCs are defined as (X) = 

/E c C '™7p(iSc)oxp[-/3iS c ]<i£; c 

where p (E c ) is distribution of E c values before selec- 
tion. In the limit of large number of T cells (K ^> 1) 
the extreme value distribution II* (x\s) is sharply peaked 
around 

E* (s) = (Pc)+EL 



2 In A) (E 2 ) 



■e£iV(-,)1, (13) 



and in the large N limit the condition for recognition of 
foreign peptides becomes 



The probability for a sequence s to be recog- 
nized is governed by the Boltzmann weight p (s ) oc 

(ilili /s-i) ex P hP* E o («)]) where |/ a j arc natural fre- 
quencies of different amino acids in the pathogen pro- 
teome, while the effect of TCR recognition is captured 
in the parameter (3*. As in the previous section, we in- 
troduce a new Hamiltonian Hq (s) = (E c ) — J*(E 2 ) C + 

E*Li [£*( s -7*V*(«<)] -lnA7(2 7 *), and to ensure 
the same average energies (Eq (s )} and (Hq (s )} we set 
7 * (/?*) = \J\n.K/ (2(A2) c + 2N(V*)). Finally, the value 
of 0* is determined by constraining the average energy 
(Eq (s)) < E n , while maximizing entropy. If E^ id > E n , 
we must set (3* such that (Eq (s)) = E n , where E* nid is 
defined as in the previous section. In this case, (3* > 0, 
and only foreign peptides with stronger amino acids are 
recognized. If A* lid < E n , we must set /3* = 0, and there 
is no modification due to recognition, i.e. every foreign 
peptide is recognized. Note that unlike for the thymic 
selection of T cell receptors (the parameter j3), the pa- 
rameter /3* cannot be negative, because there is no lower 
energy bound for recognition in Eq. (|14[) . With all pa- 
rameters determined, wc can calculate the amino acid 
frequencies of recognized foreign peptides 



J a 



(rec) 



/ a exp[-/3*(g*(q)- 7 *V*( a ))] 
E'=iAexp [-/3* (£*(&)- 7 * V(6))] 



(15) 



E* (S) < E n 



(14) 



Figure 0] depicts variation of f3* function of the 
number of selected TCRs (\n(K)/N), the number of 
self peptides (\n(M)/N) against which TCRs were se- 
lected, and the threshold for negative selection E n /N 
with (E p - E n )/N = 0.5k B T. The region /3* = is pos- 
sible only for K > M and for parameters where \/3\ is 
small (cf. Fig. [5^). That is all foreign peptides are rec- 
ognized, when there are lots of TCRs or when TCRs are 
selected against a small number of self peptides, M, in 
the thymus; neither condition is biologically true. 

We also compared the analytical results, which are ex- 
act in the limit N — >• oo with numerical simulations for 
N = 5 (Fig. [5]). From the set of parameters that are 
relevant for thymic selection in the mouse (i.e. N = 5, 
E p —E n = 2.5k B T, A c , max = E p —NJ, A c , m in = E n —NJ, 
and M = 10 3 ), we generated a pool of K = 10 3 selected 
TCRs. Then wc randomly generated 10 6 foreign pep- 
tides with amino acid frequencies L which were repre- 
sentative of L. monocytogenes (Til. [23l|. and checked the 
amino acid composition of foreign peptides that were rec- 
ognized by at least one TCR. We find that selected TCRs 
recognize only foreign peptides that are enriched with 
strongly interacting amino acids (Fig. [SJ . Increasin g: ex - 
perimental evidence indicates that this may be true [l9| . 
The physical reason for this was discussed in the previ- 
ous section. The quantitative agreement between sim- 
ulations (black line) and the analytical result (red line, 
P* = 0.49(/fc B T)- 1 and 7* = 1.80(fc B T)- 1 ) is not very 
good (Fig. [5]). 
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FIG. 4: Color representation of the dependence of the parameter /3* on the number of selected TCRs (\n(K)/N), the number 
of self peptides (ln(M)/N) against which TCRs were selected, and threshold for negative selection E n /N. Parameters are: 
[E p — E n )/N — 0.5fcsT and -E c , m in — Bcmax = E n — E p , in the limit of large N. Solid black lines separate regions with /3* > 
and ft* = 0. The regions below the black dashed line in (a) and between the black dashed lines in (b) correspond to /3 = 
(every TCR is selected). In (a) £ c , min = E n - Nl and in (b) ln(K)/N = 1.5. 
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-simulation - uncorrelated TCRs 




F L I W M V 
strong a.a. 



D G S N Q P 
weak a.a. 



FIG. 5: Amino-acid composition of recognized foreign pep- 
tides. The amino acids are ordered in decreasing frequency 
along the abscissa. The data points in black are obtained 
numerically with the parameters relevant to TCR selection 
in the mouse and K = 10 3 TCRs, which where than chal- 
lenged with L. monocytogenes peptides (see text). The blue 
data points are for similarly challenged K = 10 3 uncorrelated 
TCRs. The error bars reflect the sample size used to gen- 
erate the histograms and differences for different realizations 
of M self peptides (black) or K uncorrelated TCRs (blue). 
The red line is the result of the EVD analysis in the large 
N limit from Eq. (|15[) . where TCR amino acid frequencies 
were obtained from Eq. |[9j- In all cases we have used the 
Miyazawa-Jernigan matrix, J [2j}, amino acid frequencies f a 
from the human proteome (using the mouse proteome does 
not change the qualitative results [13]), and amino acid fre- 
quencies f a from the L. monocytogenes proteome 0, [2l[ . 



This suggests that our assumption in the analytical 
model that the pool T of selected TCRs is uncorrelated, 
might not be good for short sequences. To test the effect 
of correlations, we generated a set of K = 10 3 uncorre- 
lated TCRs with amino acid frequencies /a obtained 
from Eq. (|9]) and with MHC binding strengths drawn 
from the distribution in Eq. (|10|) and then checked the 
amino acid composition of foreign peptides recognized 
by this set. Figure [5] shows better agreement between 
the analytical result (red line) and simulations (blue data 
points) with uncorrelated TCRs. However, the analytical 
results still vary significantly from simulations. The large 
discrepancy is likely due to the inaccurate approximation 
of micro-canonical with the canonical ensemble of recog- 
nized foreign peptides for short peptides (N — 5), which 
only holds in the limit of large peptides (N — > oo). Worse 
agreement between the numerical (blue line) and analyt- 
ical results (red line) for the amino acid composition of 
recognized peptides (Fig. [5]) compared to the results for 
the amino acid composition of selected TCRs (Fig. [5Jd) 
might be due to lower magnitude of numerically obtained 
parameters \/3*\ = 0.49(fc B T)- 1 < \/3\ = 3.06(fc s T)~ 1 
and the exponential dependence of amino acid frequen- 
cies on parameters /3 and /3* (Eqs. ^ and (TT51) ). On the 
other hand the large difference between the numerically 
obtained results in black and blue lines is only due to the 
correlations of the selected TCRs in the thymus. 

To examine this further, we have also tested the effect 
of correlations in the limit of a large number K of selected 
T cells, since a mouse has ~ 10 8 distinct T cells and a 
human has ~ 10 9 distinct T cells. Both numbers are 
larger than the total number of possible sequences (20 N 
for N — 5) in TCR peptide binding regions. This sug- 
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gests that the sequence length N should be larger (N = 6 
or TV = 7) or that TCRs differ also in regions that do not 
bind peptide, e.g. different TCR binding strengths to 
the same MHC type, different sets of TCR pools that 
correspond to different MHC types (each human has up 
to 12 different MHC types). For the values of K that are 
of the order of the total number of sequences, the EVD 
n* probes the tails of distribution p* (x\s), where it is 
no longer Gaussian. Because the distribution p* (x\s) 
is bounded, in the large K limit the EVD approaches a 
delta-function centered at the TV-independent value cor- 
responding to the optimal binding energy: 



N 



mm 

{s ,?}er 



E c + J(t 




(16) 



In the limit TV — > oo, a pool of TCR sequences is uncor- 
relatcd and the optimal binding energy can be written 



N 



E ) — -^e.min ~t~ ^ ^min(^i); 



(17) 



min ?eM V£2iLi J (ti, Si)\ and constructed an interval 
of E c values that could result in the TCR selection 

as (^min! £ *,max) = ( E n ~ -Emm (t) , E p - E min (f)) . 

The actually selected E c values for a given TCR se- 
quence t are obtained by intersecting the (E* min , E* max ) 
interval with the allowed interval (-E^min, E CjIaax ) be- 
fore selection. Thus we obtained a complete pool 

of selected TCRs with weights P (E c ) (jjf =1 ft^ x 

XB c e(Ec,n,»,£c,»»x)n(£* mill ,-B c * m „)> where \ is an indicator 
function with a value of 1 if a TCR with given E c and 
t is selected and otherwise. The complete pool of se- 
lected TCRs was then challenged against 10 5 randomly 
generated foreign peptides that were representative of L. 
monocytogenes and the whole process was repeated 1, 000 
times with different realizations of M = 10 3 self peptides. 
Fig.[5]shows a large disagreement between numerical sim- 
ulations (black data points), where only foreign peptides 
with strongly interacting amino acids are recognized, and 
the analytical result (red line, Eq. (|2"Dj) ). where every for- 



wherc J m in(a) = min;, J(&, a). The condition for recogni- 
tion of foreign peptides is 



E* (?) < E n , 



(18) 



and the probability for a sequence s to be recog- 
nized is governed by the Boltzmann weight p (s ) oc 

(riili fsi) CX P [~P* E o (*)]) where |/ tt | arc natural fre- 
quencies of different amino acids in the pathogen pro- 
teome, while the effect of TCR recognition is captured 
in the parameter /?*. The value of (3* is determined by 
constraining the average energy (Eq (s )) < E n , while 
maximizing entropy in the same manner as described be- 
fore (see the paragraph after Eq. (|T4|) ). The condition 
for /?* > can be simplified to: 



E, n — E r 



TV 



= (./min (a)) = 



Eo=l Jminfg CXp [-/3* J mill (a)] 
Ea=l fa exp [-/3* Jmin(o)] 



(19) 

With the parameter f3* determined, we can calculate the 
amino acid frequencies of recognized foreign peptides 



J a 



(rcc) 



/ a exp [-/3*J min (a)] 
Eb=i h exp [-/3* Jmm(b)\ 



(20) 



For the relevant parameters in mouse (i.e. TV 
E p —E n = 2.5/«bT, -ETcmax = Ep—NJ, E c-min = E n —NJ, 
and M = 10 3 ), we obtain (3* = 0, which means that every 
foreign peptide is recognized. We tested this numerically 
by checking the properties of foreign peptides recognized 
by a complete pool of selected TCR. For each of the 20^ 
possible TCR sequences t, we calculated the strongest 
interaction with M = 10 3 self peptides 7TT m i n (t) = 



-e- simulation 

— EVD in large N limit 

- a -simulation - uncorrelated TCRs 




FL IWMVYCRHKTA 
strong a.a. 



D G S N Q P 
weak a.a. 



FIG. 6: Amino-acid composition of foreign peptides recog- 
nized by a complete pool of selected TCRs, The amino acids 
are ordered in decreasing frequency along the abscissa. The 
data points in black are obtained numerically where a com- 
plete pool of selected TCRs (obtained from thymic selection 
of all 20^ TCR sequences against M = 10 3 self peptides) is 
challenged with L. monocytogenes peptides (see text). The 
error bars reflect the differences for different realizations of 
M = 10 3 self peptides. The red line is the result of the EVD 
analysis in the large TV limit from Eq. (|19[) . The data points 
in blue are obtained numerically where a complete pool of 
uncorrelated TCRs, which is equal to the complete pool of 
20^ sequences (see text), are challenged with L. monocyto- 



genes peptides. In this case, there are no error bars, because 
we use the complete pools of TCRs and foreign peptides with 
appropriate weights. In all cases we have used the Miyazawa- 
Jernigan matrix, J [STJ], amino acid frequencies f a from hu- 
mans (using the mouse proteome does not change the qual- 
itative results [l(|) 
L. monocytogenes [i6|, 1211 



,, and amino acid frequencies f a from the 
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eign peptide is recognized. Because the selected TCR 
sequences are correlated, we also tested the effect of cor- 
relations. Since fa > for every amino acid a and 
p(sel) > for all allowed E c values, the complete pool 
of uncorrelated TCRs correspond to all 2Q N sequences 
and all possible interaction values with MHCs, E c , with 
weights p( scl ) (E c ) Ylf =1 fj^- When a foreign peptide is 
tested against the complete pool of TCRs, it finds a TCR 
that results in optimal binding energy (Eq. (|17p). To 
numerically calculate the amino acid frequencies of for- 
eign proteins recognized by complete uncorrelated pool 
of TCRs, we generate all possible 20 N foreign peptide se- 
quences with appropriate weights ^nUi A foreign 
peptide sequence s is then recognized if the optimal bind- 
ing energy (Eq. (JTTJ)) exceeds the recognition threshold 
(Eq. ([HI)). The numerical result for uncorrelated TCRs 
(blue data points) agrees very well with the analytical 
results (red line), which indicates that the correlations 
in selected TCR sequences (black data point) have an 
important role. 

Escape of autoimmune T cells 



completed, we check if any of the selected TCRs inter- 
acts strongly with any self-peptide (-Eint < E r = E n ). 
Deterministic criteria are now used because strong in- 
teraction free energy leads to high probability of T cell 
activation. Any other deterministic or stochastic crite- 
ria would not qualitatively change the results. We find 
that the introduction of soft thresholds for positive and 
negative selection does not qualitatively change the re- 
sults reported earlier regarding the composition of se- 
lected TCRs, etc. (Fig. [7^ and data not shown). Fig.[7t> 
shows that increasing the softness of the threshold for 
negative selection a n increases the chance of escape of 
autoimmune TCRs. This is because strongly interact- 
ing TCRs are negatively selected with lower probability 
when the threshold for negative selection is softer. Inter- 
estingly, the ratio of the number of autoimmune T cells 
to the number of selected T cells seems to be roughly con- 
stant with the number M of self-peptides used during the 
development in thymus (Fig. [7p). The fraction of autoim- 
mune T cells increases with M, but the rate of increase 
is small for large M. Note that with increasing number 
of self peptides M both the nominator and denominator 
are decreasing, but the ratio is roughly constant. 



Thymic selection is not perfect and autoimmune T 
cells, which interact strongly with self pMHCs, can es- 
cape from the thymus. Due to stochastic effects, it may 
happen that a diffusing T cell in the thymus never in- 
teracts with some peptides that would lead to negative 
selection. Also, even if a TCR binds strongly to a self- 
pMHC, it can escape with some probability because the 
negative selection threshold is not sharp. Here we only 
focus on the latter effect, which can be modeled with a 
soft threshold for negative selection. For a TCR t that 
interacts with self-peptide s, the probability of negative 
selection is assumed to be 



Pn (t, S ' 



1 



1 + exp [- (E int (t, s) - E n ) /a n ] 



(21) 



where the parameter a n denotes the softness of negative 
selection threshold. For a TCR that interacts strongly 
with sclf-pMHC (i?i n t < E n ) the probability of negative 
selection is close to 1, while for a TCR that interacts 
with self-pMHC weakly (E; n t > E n ) the probability of 
negative selection is small. Similarly we can define the 
probability for positive selection P p with the correspond- 
ing softness a p . From experiments [l2j we know that 
the threshold for positive selection is softer (a p > a n ). 
In this case thymic selection is modeled by testing each 
TCR sequence with all self-pMHCs: for each self-peptide 
we calculate the corresponding probabilities P p and P n 
of positive and negative selection, then we draw two uni- 
formly distributed random numbers r p and r n from the 
(0,1) interval and a TCR is positively (negatively) se- 
lected if r p < P p (r„ < P n ). After thymic selection is 



DISCUSSION 

In this paper we extend our understanding of the prob- 
lem of how the thymus designs a T cell repertoire that 
is both specific and de gen erate for pathogenic peptide 
recognition. Previously T^- 18 1 we argued that for selec- 
tion against many self-peptides, negative selection im- 
poses a constraint which results in selected TCR se- 
quences composed of predominantly weakly interacting 
amino acids. We now find additionally that negative se- 
lection also results in selected TCRs that bind relatively 
weakly to MHC. But, interactions with MHC cannot be 
arbitrarily weak as that would prevent proper TCR ori- 
entation on MHC and peptide recognition. 

Binding of such TCRs to pathogenic peptides is suffi- 
ciently strong for recognition only when pathogenic pep- 
tides are composed of predominantly strongly interacting 
amino acids. This is not too restrictive for the immune 
system, because several pathogenic peptides are derived 
from pathogenic proteins and presented to T cells. It 
is enough for T cells to recognize just a few pathogenic 
peptides, to activate the immune system and clear the 
infection. This may contribute to why there are only a 
few immunodominant peptides corresponding to any in- 
fection. 

Equations (| 15[) and ([20]) provide an analytical expres- 
sion that captures the characteristics of amino acids of 
the recognized foreign peptides. Figures [5] and [6] show 
that the analytical result is not very accurate for short 
(N = 5) peptide sequences and we showed (Figs. [S] and 
[5]) that this is due to the correlations in selected TCR 
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number of important contacts a n m 

for TCR recognition of pathogenic peptides 



FIG. 7: Thymic selection with soft thresholds for positive and negative selection, a) Histogram of the number of important 
contacts (defined in text) with which T cells recognize pathogenic peptides. This result and other characteristics of selected 
TCRs (data not shown) are qualitatively equivalent to the results obtained with sharp thresholds for positive and negative 
selection (Figs. [T]and[3]). b) Ratio of the numbers of escaped autoimmune TCRs and selected TCRs as a function of the 
softness of the threshold for negative selection o n . Fraction of autoimmune TCRs increases with a n , until the softness of the 
thresholds becomes of the same order as the separation between the thresholds of positive negative selection, a n -\-a p ~ E p — E n . 
fj p = lfcsT, M = 10 3 . c) Ratio of the numbers of escaped autoimmune TCRs and selected TCRs as a function of the number 
of types of self peptides (M). Fraction of autoimmune TCRs increases with M and is roughly constant for large M, however 
the absolute numbers of TCRs are decreasing with M. a v — lksT, <r n = O.lfcsT. The error bars in (b) and (c) correspond to 
the standard deviation of the fractions of escaped TCRs obtained from repeating the thymic selection process many times. 



sequences. In the future it would be interesting to study 
how (or if) these correlations vanish as peptide sequence 
length (N) is increased. 

It is known that people who express a particular MHC 
type called HLA-B57 are more likely to control HIV infec 
tion than people without this MHC 
study 



25,2 



181 we found that HLA-B57 bind 



In a previous 
6 times fewer 

peptides than MHC molecules that are associated with 
faster progression to AIDS. This means that TCRs re- 
stricted for HLA-B57 are selected against fewer types of 
self peptides in the thymus, which results in a more cross- 
reactive T cell repertoire (Fig. and Fig. [7^). In that 
study we showed that more cross-reactive T cell reper- 
toire could contribute to better control of HIV infection. 
Interestingly, people expressing HLA-B57 are also more 
prone to autoimmune diseases |27J, [28( . We also studied 
the escape of autoimmune T cells from the thymus as a 
function of the number of self peptides M. While the 
ratio of escaped autoimmune T cells to selected T cells is 
roughly independent of M in the relevant regime (M ~ 
a few thousands, Fig. Wp), the absolute numbers of es- 
caped autoimmune T cells is higher. This implies that 
the rate of escape of autoimmune T cells could be higher 
in people expressing HLA-B57. 

Note, however that allowing escape of autoimmune 
TCRs by having a soft threshold for negative selection 
does not alter our qualitative results regarding the char- 
acteristics of selected TCRs, and the origins of specific 
and degenerate TCR recognition of pathogen. 
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